I

1

The simple linear regression model is given by: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]

for observations \((Y_i, X_i)\).

In matrix form, we write: \[ \textbf{Y} = \begin{pmatrix} Y_1 \\ Y_2 \\ Y_3 \\ Y_4 \\ Y_5 \end{pmatrix} \]

\[ \mathbf{X} = \begin{pmatrix} 1 & X_1 \\ 1 & X_2 \\ 1 & X_3 \\ 1 & X_4 \\ 1 & X_5 \end{pmatrix} \]

\[ \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix} \]

The ordinary least squares (OLS) estimator of \(\boldsymbol{\beta}\) is: \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} \]

For the following data: \[ \begin{array}{c|c|c|c|c|c} i & 1 & 2 & 3 & 4 & 5 \\ \hline Y_i & -0.1 & 2.9 & 6.2 & 7.3 & 10.7 \\ X_i & 1 & 3 & 5 & 7 & 9 \\ \end{array} \]

\[ \mathbf{Y} = \begin{pmatrix} -0.1 \\ 2.9 \\ 6.2 \\ 7.3 \\ 10.7 \end{pmatrix} \] \[ \mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 3 \\ 1 & 5 \\ 1 & 7 \\ 1 & 9 \end{pmatrix} \]

\[ \mathbf{X}^T \mathbf{X} = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 3 & 5 & 7 & 9 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & 3 \\ 1 & 5 \\ 1 & 7 \\ 1 & 9 \end{pmatrix} = \begin{pmatrix} 5 & 25 \\ 25 & 165 \end{pmatrix} \]

\[ (\mathbf{X}^T \mathbf{X})^{-1} = \frac{1}{5 \cdot 165 - 25 \cdot 25} \begin{pmatrix} 165 & -25 \\ -25 & 5 \end{pmatrix} = \frac{1}{825 - 625} \begin{pmatrix} 165 & -25 \\ -25 & 5 \end{pmatrix} = \frac{1}{200} \begin{pmatrix} 165 & -25 \\ -25 & 5 \end{pmatrix} = \begin{pmatrix} 0.825 & -0.125 \\ -0.125 & 0.025 \end{pmatrix} \]

\[ \mathbf{X}^T \mathbf{Y} = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 3 & 5 & 7 & 9 \end{pmatrix} \begin{pmatrix} -0.1 \\ 2.9 \\ 6.2 \\ 7.3 \\ 10.7 \end{pmatrix} = \begin{pmatrix} 27 \\ 187 \end{pmatrix} \]

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} = \begin{pmatrix} 0.825 & -0.125 \\ -0.125 & 0.025 \end{pmatrix} \begin{pmatrix} 27 \\ 187 \end{pmatrix} = \begin{pmatrix} 0.825 \cdot 27 + (-0.125) \cdot 187 \\ (-0.125) \cdot 27 + 0.025 \cdot 187 \end{pmatrix} = \begin{pmatrix} -1.1 \\ 1.3 \end{pmatrix} \]

The OLS gives: \[ \hat{\beta}_0 = -1.1, \quad \hat{\beta}_1 = 1.3 \]

The fitted line is: \[ \hat{Y}_i = -1.1 + 1.3 X_i \] Residual MSE: 0.413.

Variance-Covariance Matrix: \[ \mathbf{X} = \begin{pmatrix} 0.341 & -0.0517 \\ -0.0517 & 0.0103 \end{pmatrix} \] Predicted Value: \[ \hat{\boldsymbol{Y}} = \begin{pmatrix} 0.2 & 2.8 & 5.4 & 8.0 & 10.6 \end{pmatrix}^T \] Residuals: \[ \begin{pmatrix} -0.3 & 0.1 & 0.8 & -0.7 & 0.1 \end{pmatrix}^T \]

\(\hat{\epsilon}^2 = \sum_{i=1}^{5}(Y_i-\hat{Yi})^2=\sum_{i=1}^{5}(Y_i-[-1.1 + 1.3 X_i])^2 = (-0.3)^2 + 0.1^2 + 0.8^2 + (-0.7)^2 +0.1^2 = 1.24\) \(RMSE = \hat{\epsilon}^2/(n-p) = 1.24/(5-2) = 0.413\)

2

simple_linear_regression <- function(Y, X) {
  X <- cbind(1, X)
  XTX <- t(X) %*% X
  XTX_inv <- solve(XTX)
  XTY <- t(X) %*% Y
  
  # the least squares estimates 
  beta_hat <- XTX_inv %*% XTY
  Y_hat <- X %*% beta_hat
  residuals <- Y - Y_hat

  n <- length(Y)
  p <- ncol(X)
  sigma2 <- as.numeric((t(residuals) %*% residuals) / (n - p))
  
  # Calculate the variance-covariance matrix of the least squares estimates
  var_cov_matrix <- sigma2 * XTX_inv

  list(
    beta_hat = beta_hat,
    sigma2 = sigma2,
    var_cov_matrix = var_cov_matrix,
    Y_hat = Y_hat,
    residuals = residuals
  )
}


Y <- c(-0.1, 2.9, 6.2, 7.3, 10.7)
X <- c(1, 3, 5, 7, 9)
data <- data.frame(Y=Y,X=X)

results <- simple_linear_regression(Y, X)
print(results)
## $beta_hat
##   [,1]
##   -1.1
## X  1.3
## 
## $sigma2
## [1] 0.4133333
## 
## $var_cov_matrix
##                         X
##    0.34100000 -0.05166667
## X -0.05166667  0.01033333
## 
## $Y_hat
##      [,1]
## [1,]  0.2
## [2,]  2.8
## [3,]  5.4
## [4,]  8.0
## [5,] 10.6
## 
## $residuals
##      [,1]
## [1,] -0.3
## [2,]  0.1
## [3,]  0.8
## [4,] -0.7
## [5,]  0.1
summary(lm(Y~X, data = data))
## 
## Call:
## lm(formula = Y ~ X, data = data)
## 
## Residuals:
##    1    2    3    4    5 
## -0.3  0.1  0.8 -0.7  0.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -1.1000     0.5840  -1.884  0.15612   
## X             1.3000     0.1017  12.789  0.00103 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6429 on 3 degrees of freedom
## Multiple R-squared:  0.982,  Adjusted R-squared:  0.976 
## F-statistic: 163.5 on 1 and 3 DF,  p-value: 0.001032

In summary: Coefficients: \(\hat{\beta}_0 = -1.1, \quad \hat{\beta}_1 = 1.3\), RMSE = 0.413, which agree with the manually computed estimates.

II

a

simulate_betas <- function(n) {
  X1 <- rexp(n, rate = 5)
  X2 <- 0.2 * X1 + rnorm(n, mean = 0, sd = 0.2)
  epsilon <- scale(rchisq(n, df = 2))
  Y <- 1.5 + 1 * X1 - 0.25 * X2 + epsilon
  
  fit <- lm(Y ~ X1 + X2)
  return(coef(fit)[c("X1", "X2")])
}

run_simulation <- function(n, num_simulations) {
  results <- matrix(NA, nrow = num_simulations, ncol = 2)
  
  for (i in 1:num_simulations) {
    results[i, ] <- simulate_betas(n)
  }
  return(results)
}

b

set.seed(13)

n <- 100
num_simulations <- 2000
results_100 <- run_simulation(n, num_simulations)

# Plot histograms for beta1_hat and beta2_hat
hist_beta1_100 <- ggplot(data.frame(beta1_hat = results_100[, 1]), aes(x = beta1_hat)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of beta1_hat (n = 100)", x = "beta1_hat", y = "Frequency") + xlim(-5, 5)

hist_beta2_100 <- ggplot(data.frame(beta2_hat = results_100[, 2]), aes(x = beta2_hat)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of beta2_hat (n = 100)", x = "beta2_hat", y = "Frequency") + xlim(-5, 5)


qqplot_beta1_100 <- ggplot(data.frame(beta1_hat = results_100[, 1]), aes(sample = beta1_hat)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Probability Plot of beta1_hat (n = 100)", x = "Theoretical Quantiles", y = "Sample Quantiles")

qqplot_beta2_100 <- ggplot(data.frame(beta2_hat = results_100[, 2]), aes(sample = beta2_hat)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Probability Plot of beta2_hat (n = 100)", x = "Theoretical Quantiles", y = "Sample Quantiles")


print(hist_beta1_100)

print(hist_beta2_100)

print(qqplot_beta1_100)

print(qqplot_beta2_100)

When n = 100, the sampling distributions for \(\hat{\beta_1}\) and \(\hat{\beta_2}\) are approximately normal as the histograms are bell-shaped. Also, the qqplots show that the data have approximately equal sample quantiles and theoretical quantiles.

c

set.seed(14)
n <- 15
num_simulations <- 2000

results_15 <- run_simulation(n, num_simulations)
hist_beta1_15 <- ggplot(data.frame(beta1_hat = results_15[, 1]), aes(x = beta1_hat)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of beta1_hat (n = 15)", x = "beta1_hat", y = "Frequency") + xlim(-5, 5)

hist_beta2_15 <- ggplot(data.frame(beta2_hat = results_15[, 2]), aes(x = beta2_hat)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of beta2_hat (n = 15)", x = "beta2_hat", y = "Frequency") + xlim(-5, 5)


qqplot_beta1_15 <- ggplot(data.frame(beta1_hat = results_15[, 1]), aes(sample = beta1_hat)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Probability Plot of beta1_hat (n = 15)", x = "Theoretical Quantiles", y = "Sample Quantiles")

qqplot_beta2_15 <- ggplot(data.frame(beta2_hat = results_100[, 2]), aes(sample = beta2_hat)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Probability Plot of beta2_hat (n = 15)", x = "Theoretical Quantiles", y = "Sample Quantiles")


print(hist_beta1_15)
## Warning: Removed 48 rows containing non-finite outside the scale range
## (`stat_bin()`).

print(hist_beta2_15)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

print(qqplot_beta1_15)

print(qqplot_beta2_15)

Compared to when n = 100, the histograms of the \(\hat{\beta_1}\) and \(\hat{\beta_2}\) have heavier tails, indicating larger variance. Also, the qqplot of \(\hat{\beta_1}\) shows more data points deviate from their theoretical quantiles, meaning that the sampling distribution has larger variance and deviates from normal distribution when the sample size is not large.

d

simulate_betas_d <- function(n) {
  
  X1 <- rexp(n, rate = 5)
  X2 <- 0.2 * X1 + rnorm(n, mean = 0, sd = 0.2)
  epsilon <- runif(n, -2, 2)
  Y <- 1.5 + 1 * X1 - 0.25 * X2 + epsilon
  
  fit <- lm(Y ~ X1 + X2)
  return(coef(fit)[c("X1", "X2")])
}

run_simulation_d <- function(n, num_simulations) {
  results <- matrix(NA, nrow = num_simulations, ncol = 2)
  
  for (i in 1:num_simulations) {
    results[i, ] <- simulate_betas_d(n)
  }
  return(results)
}
set.seed(13)
n <- 100
num_simulations <- 2000
results_100_d <- run_simulation_d(n, num_simulations)

hist_beta1_100_d <- ggplot(data.frame(beta1_hat = results_100_d[, 1]), aes(x = beta1_hat)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of beta1_hat (n = 100)", x = "beta1_hat", y = "Frequency") + xlim(-5, 5)

hist_beta2_100_d <- ggplot(data.frame(beta2_hat = results_100_d[, 2]), aes(x = beta2_hat)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of beta2_hat (n = 100)", x = "beta2_hat", y = "Frequency") + xlim(-5, 5)


qqplot_beta1_100_d <- ggplot(data.frame(beta1_hat = results_100_d[, 1]), aes(sample = beta1_hat)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Probability Plot of beta1_hat (n = 100)", x = "Theoretical Quantiles", y = "Sample Quantiles")

qqplot_beta2_100_d <- ggplot(data.frame(beta2_hat = results_100_d[, 2]), aes(sample = beta2_hat)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Probability Plot of beta2_hat (n = 100)", x = "Theoretical Quantiles", y = "Sample Quantiles")


print(hist_beta1_100_d)

print(hist_beta2_100_d)

print(qqplot_beta1_100_d)

print(qqplot_beta2_100_d)

set.seed(13)
n <- 15
num_simulations <- 2000

results_15_d <- run_simulation_d(n, num_simulations)
hist_beta1_15_d <- ggplot(data.frame(beta1_hat = results_15_d[, 1]), aes(x = beta1_hat)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of beta1_hat (n = 15)", x = "beta1_hat", y = "Frequency")

hist_beta2_15_d <- ggplot(data.frame(beta2_hat = results_15_d[, 2]), aes(x = beta2_hat)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of beta2_hat (n = 15)", x = "beta2_hat", y = "Frequency")


qqplot_beta1_15_d <- ggplot(data.frame(beta1_hat = results_15_d[, 1]), aes(sample = beta1_hat)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Probability Plot of beta1_hat (n = 15)", x = "Theoretical Quantiles", y = "Sample Quantiles")

qqplot_beta2_15_d <- ggplot(data.frame(beta2_hat = results_15_d[, 2]), aes(sample = beta2_hat)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Probability Plot of beta2_hat (n = 15)", x = "Theoretical Quantiles", y = "Sample Quantiles")


print(hist_beta1_15_d)

print(hist_beta2_15_d)

print(qqplot_beta1_15_d)

print(qqplot_beta2_15_d)

As I change the residuals distribution to be Uniform(-2, 2), we see that when n = 100, the sampling distributions are still approximately normal and the variance of the distribution is similar to the variance in b). Compared to when n = 100, the histograms of the \(\hat{\beta_1}\) and \(\hat{\beta_2}\) have heavier tails (especially for \(\hat{\beta_1}\)), indicating larger variance.s Also, the qqplot of \(\hat{\beta_1}\) shows more data points deviate from their theoretical quantiles, meaning that the sampling distribution deviate from normal distribution when the sample size is not large.

e

With a sufficiently large sample size (n = 100), the sampling distributions of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) are approximately normal, even if the residuals are not normally distributed. However, for smaller sample sizes (n = 15) where the central limit theorem does not apply, the normality assumption is less robust, and the sampling distributions show more deviation from normality and have larger variances.

2

a

simulate_betas <- function(n) {
  X1 <- rexp(n, rate = 5)
  X2 <- 0.2 * X1 + rnorm(n, mean = 0, sd = 0.2)
  epsilon <- rnorm(n)
  Y <- 1.5 + 1 * X1 - 0.25 * X2 + epsilon

  fit <- lm(Y ~ X1 + X2)
  beta_hat_1 <- coef(fit)["X1"]
  beta_hat_2 <- coef(fit)["X2"]
  
  vcov_fit <- vcov(fit)
  s_cov_beta_hat_1_2 <- vcov_fit[2, 3]
  s_corr_beta_hat_1_2 <- vcov_fit[2, 3] / (sqrt(vcov_fit[2, 2])* sqrt(vcov_fit[3, 3]))
  
  fit_unadj_X1 <- lm(Y ~ X1)
  fit_unadj_X2 <- lm(Y ~ X2)
  beta_hat_unadj_1 <- coef(fit_unadj_X1)["X1"]
  beta_hat_unadj_2 <- coef(fit_unadj_X2)["X2"]

  list(
    beta_hat_1 = beta_hat_1,
    beta_hat_2 = beta_hat_2,
    s_cov_beta_hat_1_2 = s_cov_beta_hat_1_2,
    s_corr_beta_hat_1_2 = s_corr_beta_hat_1_2,
    beta_hat_unadj_1 = beta_hat_unadj_1,
    beta_hat_unadj_2 = beta_hat_unadj_2
  )
}

b

num_simulations <- 2000
n <- 100
set.seed(13)
beta_hat_1_mlr <- numeric(num_simulations)
beta_hat_2_mlr <- numeric(num_simulations)
beta_hat_1_unadj <- numeric(num_simulations)
beta_hat_2_unadj <- numeric(num_simulations)


for (i in 1:num_simulations) {
  results <- simulate_betas(n)
  beta_hat_1_mlr[i] <- results$beta_hat_1
  beta_hat_2_mlr[i] <- results$beta_hat_2
  beta_hat_1_unadj[i] <- results$beta_hat_unadj_1
  beta_hat_2_unadj[i] <- results$beta_hat_unadj_2
}

avg_beta_hat_1_mlr <- mean(beta_hat_1_mlr)
avg_beta_hat_2_mlr <- mean(beta_hat_2_mlr)
avg_beta_hat_1_unadj <- mean(beta_hat_1_unadj)
avg_beta_hat_2_unadj <- mean(beta_hat_2_unadj)
true_beta_1 <- 1
true_beta_2 <- -0.25

bias_beta_1 <- avg_beta_hat_1_unadj - true_beta_1

cat("True beta_1:", true_beta_1, "\n")
## True beta_1: 1
cat("True beta_2:", true_beta_2, "\n")
## True beta_2: -0.25
cat("Average beta_hat_1 from MLR:", avg_beta_hat_1_mlr, "\n")
## Average beta_hat_1 from MLR: 0.9856649
cat("Average beta_hat_2 from MLR:", avg_beta_hat_2_mlr, "\n")
## Average beta_hat_2 from MLR: -0.2352616
cat("Average beta_hat_1 from unadjusted SLR:", avg_beta_hat_1_unadj, "\n")
## Average beta_hat_1 from unadjusted SLR: 0.9384525
cat("Average beta_hat_2 from unadjusted SLR:", avg_beta_hat_2_unadj, "\n")
## Average beta_hat_2 from unadjusted SLR: -0.05042763
cat("Bias in estimating beta_1 if ignoring the confounding:", bias_beta_1, "\n")
## Bias in estimating beta_1 if ignoring the confounding: -0.06154748

Compared to the SLR estimates \(\hat{\beta_1} = 0.938\), \(\hat{\beta_2} = -0.050\), the MLR estimates: \(\hat{\beta_1} = 0.986\), \(\hat{\beta_2} = -0.235\) are closer to the true \({\beta_1} = 1\), \({\beta_2} = -0.25\). If ignoring the confounding variable \(X_2\), we would underestimate the positive relationship between \(Y\) and \(X_1\) (bias = -0.062).

c

simulate_betas_independent <- function(n) {

  X1 <- rexp(n, rate = 5)
  X2 <- rnorm(n, mean = 0, sd = 0.2)
  epsilon <- rnorm(n)
  Y <- 1.5 + 1 * X1 - 0.25 * X2 + epsilon
  fit <- lm(Y ~ X1 + X2)

  beta_hat_1 <- coef(fit)["X1"]
  beta_hat_2 <- coef(fit)["X2"]

  vcov_fit <- vcov(fit)
  s_cov_beta_hat_1_2 <- vcov_fit[2, 3]
  s_corr_beta_hat_1_2 <- vcov_fit[2, 3] / (sqrt(vcov_fit[2, 2]) * sqrt(vcov_fit[3, 3]))

  fit_unadj_X1 <- lm(Y ~ X1)
  fit_unadj_X2 <- lm(Y ~ X2)
  beta_hat_unadj_1 <- coef(fit_unadj_X1)["X1"]
  beta_hat_unadj_2 <- coef(fit_unadj_X2)["X2"]

  list(
    beta_hat_1 = beta_hat_1,
    beta_hat_2 = beta_hat_2,
    s_cov_beta_hat_1_2 = s_cov_beta_hat_1_2,
    s_corr_beta_hat_1_2 = s_corr_beta_hat_1_2,
    beta_hat_unadj_1 = beta_hat_unadj_1,
    beta_hat_unadj_2 = beta_hat_unadj_2
  )
}

d

num_simulations <- 2000
n <- 100
set.seed(13)
beta_hat_1_mlr <- numeric(num_simulations)
beta_hat_2_mlr <- numeric(num_simulations)
beta_hat_1_unadj <- numeric(num_simulations)
beta_hat_2_unadj <- numeric(num_simulations)
s_cov_beta_hat_1_2 <- numeric(num_simulations)
s_corr_beta_hat_1_2 <- numeric(num_simulations)

for (i in 1:num_simulations) {
  results <- simulate_betas_independent(n)
  beta_hat_1_mlr[i] <- results$beta_hat_1
  beta_hat_2_mlr[i] <- results$beta_hat_2
  beta_hat_1_unadj[i] <- results$beta_hat_unadj_1
  beta_hat_2_unadj[i] <- results$beta_hat_unadj_2
  s_cov_beta_hat_1_2[i] <- results$s_cov_beta_hat_1_2
  s_corr_beta_hat_1_2[i] <- results$s_corr_beta_hat_1_2
}

avg_beta_hat_1_mlr <- mean(beta_hat_1_mlr)
avg_beta_hat_2_mlr <- mean(beta_hat_2_mlr)
avg_beta_hat_1_unadj <- mean(beta_hat_1_unadj)
avg_beta_hat_2_unadj <- mean(beta_hat_2_unadj)
avg_s_cov_beta_hat_1_2 <- mean(s_cov_beta_hat_1_2)
avg_s_corr_beta_hat_1_2 <- mean(s_corr_beta_hat_1_2)

true_beta_1 <- 1
true_beta_2 <- -0.25

cat("Average beta_hat_1 from MLR:", avg_beta_hat_1_mlr, "\n")
## Average beta_hat_1 from MLR: 0.9886126
cat("Average beta_hat_2 from MLR:", avg_beta_hat_2_mlr, "\n")
## Average beta_hat_2 from MLR: -0.2352616
cat("Average beta_hat_1 from unadjusted SLR:", avg_beta_hat_1_unadj, "\n")
## Average beta_hat_1 from unadjusted SLR: 0.9884525
cat("Average beta_hat_2 from unadjusted SLR:", avg_beta_hat_2_unadj, "\n")
## Average beta_hat_2 from unadjusted SLR: -0.2383645
cat("True beta_1:", true_beta_1, "\n")
## True beta_1: 1
cat("True beta_2:", true_beta_2, "\n")
## True beta_2: -0.25
cat("Average covariance of beta_hats:", avg_s_cov_beta_hat_1_2, "\n")
## Average covariance of beta_hats: 0.00062601
cat("Average correlation of beta_hats:", avg_s_corr_beta_hat_1_2, "\n")
## Average correlation of beta_hats: 0.002019909

Now, the SLR estimates \(\hat{\beta_1} = 0.988\), \(\hat{\beta_2} = -0.238\); the MLR estimates: \(\hat{\beta_1} = 0.989\), \(\hat{\beta_2} = -0.235\) are roughly equivalent to each other and are both close to the true \({\beta_1} = 1\), \({\beta_2} = -0.25\). Also, \(E[cov(\hat{\beta_1}, \hat{\beta_2})] = 0.0006\), \(E[corr(\hat{\beta_1}, \hat{\beta_2})] = 0.002\), which are roughly 0. The results are as expected: as \(X_1\) and \(X_2\) are independent, the covariance is expected to be 0, and eliminating one variable should have no effect on predicting the coefficient of the other variable.

III

1

a

A #### b B #### c

load("nmes.rdata")
d3 <- nmes
d <- subset(d3, lastage==65 & eversmk %in% c(0,1))
d$ever <- as.numeric(d$eversmk)
table(d$ever)
## 
##   0   1 
## 142 163
t_test3 <- t.test(totalexp ~ ever, data = d, var.equal = T)
t_test3
## 
##  Two Sample t-test
## 
## data:  totalexp by ever
## t = -2.0937, df = 303, p-value = 0.03712
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -4348.020  -134.757
## sample estimates:
## mean in group 0 mean in group 1 
##        2092.803        4334.192
anova_3 <- aov(totalexp ~ ever, data = d)
summary(anova_3)
##              Df    Sum Sq   Mean Sq F value Pr(>F)  
## ever          1 3.813e+08 381250445   4.384 0.0371 *
## Residuals   303 2.635e+10  86972246                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm_3 <- lm(totalexp ~ ever, data = d)
summary(lm_3)
## 
## Call:
## lm(formula = totalexp ~ ever, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4334  -3629  -1885   -723 108723 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   2092.8      782.6   2.674   0.0079 **
## ever          2241.4     1070.5   2.094   0.0371 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9326 on 303 degrees of freedom
## Multiple R-squared:  0.01426,    Adjusted R-squared:  0.01101 
## F-statistic: 4.384 on 1 and 303 DF,  p-value: 0.03712
res_df <- summary(anova_3)[[1]]["Residuals", "Df"]
res_ss <- summary(anova_3)[[1]]["Residuals", "Sum Sq"]
res_se_anova <- sqrt(res_ss/res_df)
res_se_anova
## [1] 9325.891

The square root of the mean sqaured error form the ANOVA is 9325.891, which is equivalent to the residuals standard error from the lm function.

For ever, \(t-statistic = 2.094\), \((t-statistic)^2 = 4.385 = F-statistic\). The p-values are both 0.0371.

2

a

d$X <- ifelse(d$current == "1", 2,
ifelse(d$former== "1", 1,
ifelse(d$current == "." & d$former== ".", NA, 0)))
d <- d[!is.na(d$X), ]
table(d$X)
## 
##   0   1   2 
## 142 105  53
anova_fit <- aov(totalexp~ factor(X), data = d)
summary(anova_fit)
##              Df    Sum Sq   Mean Sq F value Pr(>F)  
## factor(X)     2 4.161e+08 208031361   2.351  0.097 .
## Residuals   297 2.628e+10  88474011                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm_fit <- lm(totalexp~ factor(X), data = d)
summary(lm_fit)
## 
## Call:
## lm(formula = totalexp ~ factor(X), data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4452  -3663  -1881   -722 108605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   2092.8      789.3   2.651  0.00845 **
## factor(X)1    2358.2     1210.6   1.948  0.05237 . 
## factor(X)2    2359.6     1514.1   1.558  0.12018   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9406 on 297 degrees of freedom
## Multiple R-squared:  0.01559,    Adjusted R-squared:  0.008958 
## F-statistic: 2.351 on 2 and 297 DF,  p-value: 0.09701

The linear model is: \(Y_i = \beta_0 + \beta_1 I_{former,i} + \beta_2 I_{current,i} + \epsilon_i\)

where: - \(I_{former,i}\) is an indicator variable for the “former” group (X = 1).

  • \(I_{current,i}\) is an indicator variable for the “current” group (X = 2).

  • For never smokers, \(I_{former,i} = I_{current,i} = 0\): \(\mu_{\text{never}} = E[Y|never] = \beta_0\)

  • For former smokers, \(I_{former,i} = 1, I_{current,i} = 0\): \(\mu_{\text{former}} = E[Y|former] = \beta_0 + \beta_1\)

  • For current smokers, \(I_{former,i} = 0, I_{current,i} = 1\): \(\mu_{\text{current}} = E[Y|current] = \beta_0 + \beta_2\)

The null hypothesis: \(H_0: \mu_{\text{never}}=\mu_{\text{former}}=\mu_{\text{current}}\) is equivalent to \(\beta_0 = \beta_0 + \beta_1 = \beta_0 + \beta_2\); equivalent to \(0 = \beta_1 = \beta_2\); equivalent to \(H_0:\beta_1 = 0\), and \(\beta_2 = 0\).

d

The ANOVA result has a F-statistics of 2.35 with P-value of 0.097. We failed to reject the null \(H_0: \mu_{\text{never}}=\mu_{\text{former}}=\mu_{\text{current}}\) at the significance level of 0.05, and conclude that there is no significant differences in mean total medical expenditures between the never, former, and current smokers.

IV

d <- subset(nmes, lastage >= 65 & eversmk %in% c(0, 1))
d$agem65 <- d$lastage- 65
d$age_sp1 <- pmax(0, d$lastage- 75) 
d$age_sp2 <- pmax(0, d$lastage- 85)
d$ever <- ifelse(d$eversmk== 1, 1, 0)
model_4 <- lm(totalexp~ agem65 + age_sp1 + age_sp2 + ever +
agem65:ever + age_sp1:ever + age_sp2:ever, data = d)
summary(model_4)
## 
## Call:
## lm(formula = totalexp ~ agem65 + age_sp1 + age_sp2 + ever + agem65:ever + 
##     age_sp1:ever + age_sp2:ever, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9846  -3739  -2838   -882 171074 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2445.18     469.31   5.210 1.97e-07 ***
## agem65         161.72      73.38   2.204   0.0276 *  
## age_sp1       -102.24     140.94  -0.725   0.4682    
## age_sp2        546.81     257.02   2.127   0.0334 *  
## ever          1513.54     624.50   2.424   0.0154 *  
## agem65:ever   -140.64     100.12  -1.405   0.1602    
## age_sp1:ever   261.66     206.39   1.268   0.2049    
## age_sp2:ever  -964.30     463.21  -2.082   0.0374 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10040 on 4720 degrees of freedom
## Multiple R-squared:  0.008323,   Adjusted R-squared:  0.006852 
## F-statistic: 5.659 on 7 and 4720 DF,  p-value: 1.591e-06
summary_table <- data.frame(
Estimate = round(coef(model_4)[c("agem65", "age_sp1", "ever", "agem65:ever", "age_sp1:ever")], 3),
"95% CI" = sprintf("[%.3f, %.3f]",
confint(model_4)[c("agem65", "age_sp1", "ever", "agem65:ever", "age_sp1:ever"), 1],
confint(model_4)[c("agem65", "age_sp1", "ever", "agem65:ever", "age_sp1:ever"), 2])
)
knitr::kable(summary_table, col.names = c("Estimate", "95% CI"))
Estimate 95% CI
agem65 161.715 [17.865, 305.566]
age_sp1 -102.238 [-378.537, 174.062]
ever 1513.540 [289.230, 2737.850]
agem65:ever -140.640 [-336.914, 55.633]
age_sp1:ever 261.659 [-142.965, 666.283]

agem65: This coefficient measures the estimated effect of age on total medical expenditures for individuals who have never smoked and are between 65 and 75 years old. The positive estimate of 162 indicates that for every additional year of age above 65, the expected total medical expenditure increases by approximately 162 dollars. The 95% confidence interval of 17.9 to 305.57 dollars indicates that we can be 95% confident that the true effect falls within this range, indicating a positive association between age and medical expenditures for never smokers in this age group.

age_sp1: This coefficient measures the additional effect on medical expenditures for individuals aged between 75 and 85, accounting for any changes in the impact of age after 75 years. The negative estimate of −102 suggests a potential decrease in medical expenditures by 102 dollars for those in this age range. However, the 95% confidence interval of -378.537 to 174.062 which includes 0, indicates that the true effect could be positive or negative.

ever: This coefficient measures the estimated difference in medical expenditures between ever smokers and never smokers. The positive estimate of 1514 dollars indicates that ever smokers tend to have higher medical expenditures by about 1514 dollars compared to never smokers on average. The 95% confidence interval, which is excludes 0, suggests that we are confident in the positive effect of smoking history on medical expenditures.

agem65:ever: This interaction term measures how the effect of age on medical expenditures differs for ever smokers aged between 65 and 75. The estimate of −141 suggests that the rate of increase in expenditures with age is lower by 141 dollars for ever smokers compared to never smokers. However, the 95% confidence interval, which includes 0, indicates that we cannot be confident about the significance of the negative effect.

age_sp1:ever: This interaction term measures how the effect of being an ever smoker influences the rate of medical expenditures for individuals aged between 75 and 85. The estimate of 262 suggests a possible greater rate of increase in expenditures by 262 dollars per additional year of age for ever smokers. However, the 95% confidence interval includes 0, indicates that we cannot be confident about the significance of the positive effect.

grid_4 <- expand.grid(
age = seq(65, 94, by = 1),
ever = c(0, 1))

grid_4$agem65 <- grid_4$age- 65
grid_4$age_sp1 <- pmax(0, grid_4$age- 75)
grid_4$age_sp2 <- pmax(0, grid_4$age- 85)
grid_4$pred <- predict(model_4, newdata = grid_4)

ggplot(d, aes(x = lastage, y = totalexp, color = factor(ever))) +
geom_point(alpha = 0.2) +
geom_line(data = grid_4,
aes(x = age, y = pred, color = factor(ever)),
size = 1) +
labs(color = "Ever Smoker",
x = "Age",
y = "Medical Expenditure",
title = "Linear Fit of Medical Expenditure vs. Age") +
theme(legend.position = "bottom") +
theme_minimal()  
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

3

grid_never <- subset(grid_4, ever== 0, select = c("age", "pred"))
grid_ever <- subset(grid_4, ever== 1, select = c("age", "pred"))

diff_df <- data.frame(
age = grid_never$age,
diff = grid_ever$pred- grid_never$pred)

ggplot(diff_df, aes(x = age, y = diff)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_line(size = 1, color = "black") +
labs(
title = "Difference in Mean Medical Expenditures vs. Age",
x = "Age",
y = "Difference in Predicted Means"
) +
theme_minimal()

Possible reasons:

  1. Big variability due to small sample. We have fewer data for individuals over 85 years and even fewer for smokers over 85 years, the prediction might not be accurate due to the limited sample sizes.

  2. Survivorship Bias. Those who ever smoked and live over 85 years might be genetically healthier than others, and therefore have lower medical expenditure.

4

\(total\_exp_i = \beta_0 + \beta_1 agem65_i + \beta_2 age\_sp1_i + \beta_3 age\_sp2_i + \beta_4ever_i + \beta_5(agem65_i \times ever_i)+ \beta_6(sp1_i \times ever_i) + \beta_7(sp2_i \times ever_i) + \epsilon_i\) The difference of mean total expenditures between ever smokers and non-smokers:

  • Age 65: difference = \(\beta_4\)
  • Age 75: difference = \(\beta_4 + 10\beta_5\)
  • Age 90: difference = \(\beta_4 + 25\beta_5 + 15\beta_6 + 5\beta_7\)
get_diff_se <- function(age) {
 x_agem65 <- age - 65
 x_age_sp1 <- max(0, age- 75)
 x_age_sp2 <- max(0, age- 85)
 x <- c(0, 0, 0, 0, 1, x_agem65, x_age_sp1, x_age_sp2) 
 diff <- sum(x * coef(model_4))
 var_diff <- as.numeric(t(x) %*% vcov(model_4) %*% x)
 se_diff <- sqrt(var_diff)
 ci_lower <- diff - qnorm(0.975) * se_diff
 ci_upper <- diff + qnorm(0.975) * se_diff
 return(c(diff = diff, se = se_diff, lwr = ci_lower, upr = ci_upper))
}

ages <- c(65, 75, 90)
res_l <- lapply(ages, get_diff_se)
tmp_df <- data.frame(Age = ages, do.call(rbind, res_l))

res_df <- data.frame(
Age = tmp_df$Age,
diff = sprintf("%.3f", tmp_df$diff),
se = sprintf("%.3f", tmp_df$se),
ci = sprintf("[%.3f, %.3f]", tmp_df$lwr, tmp_df$upr))

knitr::kable(res_df, align = "c", col.names = c("Age", "Estimated Difference in expenditures",
"Linear Model Std Error", "Linear Model 95% CI"))
Age Estimated Difference in expenditures Linear Model Std Error Linear Model 95% CI
65 1513.540 624.499 [289.544, 2737.536]
75 107.137 587.310 [-1043.969, 1258.242]
90 -2899.064 1672.993 [-6178.070, 379.942]

5

The ratio of mean total expenditures between ever smokers and non-smokers at age 65: \(R = \frac{\mu_{65,ever}}{\mu_{65,never}} = \frac{\beta_0+\beta_4}{\beta_0}\)

By Delta method: \(Var(\hat{\boldsymbol{R}}) = (\frac{\partial \boldsymbol{R}}{\partial \boldsymbol{\beta}})^T * Var(\hat{\boldsymbol{\beta}}) * (\frac{\partial \boldsymbol{R}}{\partial \boldsymbol{\beta}})\). $ = $ \[ \begin{pmatrix} -\beta_4/\beta_0^2 \\ 1/\beta_0 \end{pmatrix} \]

b0 <- coef(model_4)["(Intercept)"]
b4 <- coef(model_4)["ever"]

R <- (b0 + b4) / b0
dri <- c(- b4 / (b0^2), 1 / b0)
var_sub <- vcov(model_4)[c("(Intercept)", "ever"), c("(Intercept)", "ever")]
var_ratio_65 <- t(dri) %*% var_sub %*% dri
se_R <- sqrt(var_ratio_65)
CI_lower_65 <- R - qnorm(0.975) * se_R
CI_upper_65 <- R + qnorm(0.975) * se_R

ratio_results <- data.frame(
Age = 65,
Ratio = round(R, 3),
SE = round(se_R, 3),
CI = sprintf("[%.3f, %.3f]", CI_lower_65, CI_upper_65)
)
knitr::kable(ratio_results, align = "c", col.names = c("Age", "Ratio", "SE", "95% CI"))
Age Ratio SE 95% CI
(Intercept) 65 1.619 0.353 [0.926, 2.312]

6

boot_diff <- function(data, ind, age) {
d <- data[ind, ]
model <- lm(totalexp ~ agem65 + age_sp1 + age_sp2 + ever + ever:agem65 + ever:age_sp1 + ever:age_sp2, data = d)
x_agem65 <- age - 65
x_age_sp1 <- max(0, age - 75)
x_age_sp2 <- max(0, age - 85)
diff <- coef(model)["ever"] +
coef(model)["agem65:ever"] * x_agem65 +
coef(model)["age_sp1:ever"] * x_age_sp1 +
coef(model)["age_sp2:ever"] * x_age_sp2
return(diff)
}

boot_diff_age <- function(data, age, R = 1000) {
set.seed(29)
boots <- boot(data, statistic = function(d, i) boot_diff(d, i, age), R = R)
ci <- boot.ci(boots, type = "perc")$percent[4:5] 
return(c(
se = round(sd(boots$t), 3),
ci = sprintf("[%.3f, %.3f]", ci[1], ci[2])
))
}

ages <- c(65, 75, 90)

# boot_results_diff <- t(sapply(ages, function(a) boot_diff_age(d, a, R = 1000)))
# save(boot_results_diff, file = "boot_results_diff.RData")
load("boot_results_diff.RData")

res_df_boot <- data.frame(
Age = ages,
"Bootstrap SE" = boot_results_diff[, "se"],
"Bootstrap 95% CI" = boot_results_diff[, "ci"]
)

knitr::kable(res_df_boot,
align = "c",
col.names = c("Age", "Bootstrap SE", "Bootstrap 95% CI")
)
Age Bootstrap SE Bootstrap 95% CI
65 565.109 [420.23, 2639.13]
75 598.935 [-1114.05, 1236.66]
90 2279.13 [-7050.60, 1979.84]
complete_df <- merge(res_df, res_df_boot, by = "Age")
knitr::kable(complete_df, align = "c",
col.names = c("Age",
"Estimated Difference",
"Linear Model SE",
"Linear Model 95% CI",
"Bootstrap SE",
"Bootstrap 95% CI")
)
Age Estimated Difference Linear Model SE Linear Model 95% CI Bootstrap SE Bootstrap 95% CI
65 1513.540 624.499 [289.544, 2737.536] 565.109 [420.23, 2639.13]
75 107.137 587.310 [-1043.969, 1258.242] 598.935 [-1114.05, 1236.66]
90 -2899.064 1672.993 [-6178.070, 379.942] 2279.13 [-7050.60, 1979.84]
  1. The bootstrapped standard errors are lower and the 95% confidence intervals are narrower at age 65 and 75, indicating that the data at ages 65 and 75 are more concentrated and consistent compared to the data with age 90. We see that both the bootstrapped standard errors and the confidence intervals do not deviate much from those calculated from linear regression. The narrower intervals also mean that we can be more confident in estimating the differences of the average expenditures at ages 65 and 75. The narrower intervals are likely due to a larger sample size at these age groups.

  2. At age 90, the bootstrapped 95% confidence interval is much wider compared to the interval from the linear model, with a larger standard error. This indicates a greater spread in the re-sampled estimates, indicating higher variability and more uncertainty. Because the linear model assumes homoscedasticity, it underestimated the true uncertainty at the extreme age. However, the bootstrapping method captured this increased variability possibly due to the fewer observations at age 90.

boot_ratio <- function(data, ind) {
d <- data[ind, ]
model <- lm(totalexp~ agem65 + age_sp1 + age_sp2 + ever + 
              ever:agem65 + ever:age_sp1 + ever:age_sp2, data = d)
ratio <- (coef(model)["(Intercept)"] + coef(model)["ever"]) / coef(model)["(Intercept)"]
return(ratio)
}

set.seed(29)

# boot_results_ratio <- boot(d, statistic = boot_ratio, R = 1000)
# save(boot_results_ratio, file = "boot_results_ratio.RData")
load("boot_results_ratio.RData")

ci_ratio <- boot.ci(boot_results_ratio, type = "perc")$percent[4:5]

res_r_boot <- data.frame(
Age = 65,
"Bootstrap Estimated Ratio" = (round((boot_results_ratio$t0), 3)),
"Bootstrap SE" = round(sd(boot_results_ratio$t), 3),
"Bootstrap 95% CI" = sprintf("[%.3f, %.3f]", ci_ratio[1], ci_ratio[2])
)
complete_r_df <- merge(ratio_results, res_r_boot, by = "Age")
knitr::kable(complete_r_df, align = "c", col.names = c("Age",
"Estimated Ratio",
"Linear Model SE",
"Linear Model 95% CI",
"Bootstrap Estimated Ratio",
"Bootstrap SE",
"Bootstrap 95% CI")
)
Age Estimated Ratio Linear Model SE Linear Model 95% CI Bootstrap Estimated Ratio Bootstrap SE Bootstrap 95% CI
65 1.619 0.353 [0.926, 2.312] 1.619 0.299 [1.149, 2.335]

The estimated ratio of medical expenditures for ever smokers compared to never smokers at age 65 is 1.62, indicating that, on average, for those at age 65, the medical expenditures for ever smokers will be 62% higher than never-smokers.

The bootstrapped method gives lower standard error and has narrower 95% confidence interval than the those given by the linear model, suggesting more reliable variance estimation. Also, the linear regression confidence interval includes 1. This indicates no significant difference in medical expenditures between ever-smokers and never-smokers at age 65.

model_null <- lm(totalexp~ agem65 + age_sp1 + age_sp2, data = d)
model_full <- lm(totalexp~ agem65 + age_sp1 + age_sp2 + ever +
ever:agem65 + ever:age_sp1 + ever:age_sp2, data = d)

null_ll <- logLik(model_null)
full_ll <- logLik(model_full)
ll_ratio <- -2 * (as.numeric(null_ll)- as.numeric(full_ll))
df_ratio <- attr(logLik(model_full), "df")- attr(logLik(model_null), "df")

lr_p_value <- pchisq(ll_ratio, df = df_ratio, lower.tail = FALSE)
f_test <- anova(model_null, model_full)

cat("Likelihood Ratio Test Statistic:", round(ll_ratio, 3))
## Likelihood Ratio Test Statistic: 12.309
cat("p-value for LRT:", round(lr_p_value, 5))
## p-value for LRT: 0.0152
f_test
## Analysis of Variance Table
## 
## Model 1: totalexp ~ agem65 + age_sp1 + age_sp2
## Model 2: totalexp ~ agem65 + age_sp1 + age_sp2 + ever + ever:agem65 + 
##     ever:age_sp1 + ever:age_sp2
##   Res.Df        RSS Df  Sum of Sq     F  Pr(>F)  
## 1   4724 4.7747e+11                              
## 2   4720 4.7622e+11  4 1241422709 3.076 0.01532 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The likelihood ratio test and the F-test test the significance in the mean medical expenditures between ever and never smokers across ages. The p-values for both tests are around 0.015, meaning that under significance level of 0.05, we reject the null and conclude that ever and never-smokers differ in the mean medical expenditures.

8

Objective

The aim of this study is to investigate whether elders who have smoked and those who have never smoked of the same age would use similar amounts of medical services. Using data from the National Medical Expenditure Survey, we analyzed total medical expenditures as a measure of medical service utilization and use different statistical approaches to compare the expenditures between ever and never smokers aged 65 and older.

Data

The data was retrieved from the 1987 National Medical Expenditure Survey (NMES), a national survey of healthcare expenditures in the United States conducted in 1987. The dataset includes information on medical expenditures and smoking history. Out of the total 13,648 subjects, the study focuses on 4,728 individuals with age greater than 65 and non-missing smoking history.

Method

The data were filtered based on eversmk (smoking status) lastage (subject age), with totalexp (self-reported total medical expenditures for 1987) being the outcome variable. ever , an indicator variable, was created to denote smoking status (0 for never smoker; 1 for ever smoker).

A multiple linear regression (MLR) model was fitted to estimate the difference in average medical expenditures between ever and never smokers as a function of age. The model includes two knots at ages 75 and 85 to account for non-linear relationships. To capture more complex patterns between the predictor variables, the interactions terms between age the smoking status were included to account for potential changes in the rate of medical expenditure for different smoking statuses across different age groups.

To address the skewness and heteroscedasticity of the data, bootstrapping method was used. Specifically, 1000 bootstrap samples were generated; the standard error and 95% confidence intervals were then obtained by the percentile method. The bootstrapped method was used to estimate the mean difference in medical expenditures and the ratio of expenditures at age 65 between ever and never smokers. The results from the bootstrapped method (standard error, 95% confidence intervals) were then compared with those from the linear regression model. Finally, likelihood ratio tests and F-tests were performed to evaluate whether the mean medical expenditures are significantly affected by the smoke history.

Result

The multiple linear regression model revealed that, regardless of smoking status, each additional year of age between 65 and 75 is associated with an average increase of $161.72 (95% CI: [17.87, 305.57]) in medical expenditures. For individuals aged 85 and older, this increase is much higher, with an average of $546.81 per year.

When smoking status is taken into account, ever-smokers have $1,513.54 (95% CI: [289.54, 2737.53]) higher medical expenditures compared to never-smokers on average. However, an unusual trend emerges after age 85, where ever-smokers have, on average, $964.30 lower medical expenditures than never-smokers. This trend may be caused by the variation due to the limited data available for this age group or the survivorship bias. The model’s low explanatory power, accounting for only 0.83% of the variation, suggests that other factors, such as environment, genetics, or other behavioral factors, may influence total medical expenditures in addition to age and smoking history.

Bootstrapping was used to assess age-dependent uncertainty in medical expenditure estimates. At age 65, the bootstrapped standard error of 565.11 is lower than the linear model’s 624.50, with the a bootstrapped 95% CI [420.23, 2639.13] that is narrower in coverage compared to linear model’s [289.544, 2737.536]. The confidence intervals that exclude 0 also mean that the medical expenditure estimates differ for never smokers and ever smokers at age 65.

At age 75, the bootstrapped standard error of 598.94 is slightly higher than the linear model’s 587.31, with the a bootstrapped 95% CI [-1114.05, 1236.66] compared to linear model’s [-1043.969, 1258.242]. The similar results from bootstrap method and linear regression method are likely due to larger sample sizes at these age groups which result in more precise estimates.

Conversely, at age 90, the bootstrap method gives a wider 95% CI [-7050.60, 1979.84] compared to the linear model [-6178.070, 379.942], along with a larger standard error. This indicates a greater spread in the re-sampled estimates, indicating higher variability due to the limited sample size.

Both the regression and the bootstrap method give the estimated ratios of medical expenditures for ever smokers compared to never smokers at age 65 to be 1.62, indicating that, on average, for those at age 65, the medical expenditures for ever smokers will be 62% higher than never-smokers. The bootstrapped method gives lower standard error (0.299) and has narrower 95% confidence interval ([1.149, 2.335]) than the those given by the linear model (0.353, [0.926, 2.312]), suggesting more reliable variance estimation. Also, because the confidence interval obtained by the linear regression method includes 1, it indicates a lack of significant difference in medical expenditures between ever-smokers and never-smokers at age 65. This conflicting result obtained by the regression method and the bootstrapping method is worth further investigation.

Finally, we perform hypothesis tests of whether the smoker history affects the the amount of quantity of medical services used. Both the F-test (𝐹 = 3.08, 𝑝 = 0.015) and the Likelihood Ratio Test (𝜒2 = 12.3, 𝑑𝑓 = 4, 𝑝 = 0.0152) rejected the null hypothesis at the 0.05 significance level. These results indicate that total medical expenditures after age 65 are significantly different between smokers and non-smokers.

Discussion

This analysis provides evidence that for those are above 65 years old, ever-smokers have significantly higher medical expenditures compared to never-smokers. The effect is most pronounced at age 65, as confirmed by both linear regression and bootstrapping methods. At older ages, such as 90, the regression method shows higher variability in average medical expenditures and a negative mean difference, which are likely due to limited number of observations and survivorship bias. Bootstrapped confidence intervals indicate that at these extreme ages, the standard linear model may underestimate the variability.

The linear regression model indicates an overall association of increasing medical costs with age. However, the model only explains 0.8% of the variation, suggesting that other unmeasured covariates, such as socio-economic status or other behavioral factors may play a role. Thus, further research should incorporate additional covariates to improve model precision.